NBA Shot Predictor
Oliver Lee
1. Data Collection and PreprocessingΒΆ
The goal of this project is to train a model to predict the likelihood a shot is made based on a variety of factors including shot location, shot type, player stats, and more. The main data used for training is found here: https://github.com/DomSamangy/NBA_Shots_04_25. This data contains every shot taken in the NBA from 2004-2025, with features such as player, shot type, shot location, etc.
Then merge this data with individual player statistics from the NBA API as shown below.
import pandas as pd
from tqdm.notebook import tqdm
import time
from nba_api.stats.endpoints import PlayerDashboardByYearOverYear
Define function to fetch stats for a single playerΒΆ
This function uses the NBA API to get field goal %, 3-point %, and minutes played for a given player in the specified season.
def get_player_stats(player_id, season='2024-25'):
try:
dash = PlayerDashboardByYearOverYear(player_id=player_id, season=season)
df = dash.get_data_frames()[1]
latest_season = df[df['GROUP_VALUE'] == season]
stats = latest_season[['FG_PCT', 'FG3_PCT', 'MIN']].copy()
stats['PLAYER_ID'] = player_id
return stats
except Exception as e:
return None
Load the raw shot data and fetch stats for unique player IDsΒΆ
original_df = pd.read_csv("./raw_data/NBA_2025_Shots.csv")
unique_ids = original_df['PLAYER_ID'].unique()
print(f"Loaded {len(original_df)} shot records for {len(unique_ids)} unique players.")
all_stats = []
failed_ids = []
for pid in tqdm(unique_ids, desc="Fetching Player Stats"):
stats_df = get_player_stats(pid)
if stats_df is not None:
all_stats.append(stats_df)
else:
failed_ids.append(pid)
time.sleep(0.5) # Delay to avoid API rate limit
Merge the fetched stats with the original shot dataΒΆ
We'll combine all player stats, merge them with the original dataframe, then save the results.
if all_stats:
stats_combined = pd.concat(all_stats, ignore_index=True)
merged_df = original_df.merge(stats_combined, on='PLAYER_ID', how='left')
# Preview the merged data
display(merged_df.head())
# Save merged data to CSV
merged_df.to_csv("./merged_data/24_25_allstats.csv", index=False)
print(f"Saved merged stats to './merged_data/23_24_allstats.csv'.")
else:
print("No player stats were retrieved.")
if failed_ids:
print(f"Failed to fetch stats for {len(failed_ids)} players:")
print(failed_ids)
else:
print("Successfully fetched stats for all players.")
This merging process takes quite a while thanks to the API's rate limiting, but the final merged data will look like this (first 2 lines shown):
| SEASON_1 | SEASON_2 | TEAM_ID | TEAM_NAME | PLAYER_ID | PLAYER_NAME | POSITION_GROUP | POSITION | GAME_DATE | GAME_ID | HOME_TEAM | AWAY_TEAM | EVENT_TYPE | SHOT_MADE | ACTION_TYPE | SHOT_TYPE | BASIC_ZONE | ZONE_NAME | ZONE_ABB | ZONE_RANGE | LOC_X | LOC_Y | SHOT_DISTANCE | QUARTER | MINS_LEFT | SECS_LEFT | FG_PCT | FG3_PCT | MIN |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2024 | 2023-24 | 1610612764 | Washington Wizards | 1629673 | Jordan Poole | G | SG | 11-03-2023 | 22300003 | MIA | WAS | Missed Shot | False | Driving Floating Jump Shot | 2PT Field Goal | In The Paint (Non-RA) | Center | C | 8-16 ft. | -0.4 | 17.45 | 12 | 1 | 11 | 1 | 0.413 | 0.326 | 2345.555 |
| 2024 | 2023-24 | 1610612764 | Washington Wizards | 1630166 | Deni Avdija | F | SF | 11-03-2023 | 22300003 | MIA | WAS | Made Shot | True | Jump Shot | 3PT Field Goal | Above the Break 3 | Center | C | 24+ ft. | 1.5 | 30.55 | 25 | 1 | 10 | 26 | 0.506 | 0.374 | 2256.6433333 |
2. Training a Random Forest ClassifierΒΆ
import pandas as pd
import numpy as np
import joblib
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
Removing Unrelated FeaturesΒΆ
Some features should be removed before training, as they should have no impact on the shot outcome. We also drop PLAYER_ID here, but keep PLAYER_NAME as an easier way to indetify each player. Y consists of SHOT_MADE, the target prediction label for this experiment.
df = pd.read_csv('./raw_data/NBA_2025_Shots.csv')
df = df.drop(columns=['SEASON_2', 'GAME_ID', 'ZONE_ABB', 'EVENT_TYPE', 'GAME_DATE',
'PLAYER_ID', 'TEAM_ID', 'TEAM_NAME'])
X = df.drop(columns=['SHOT_MADE', 'PLAYER_NAME'])
y = df['SHOT_MADE'].astype(int)
X_encoded = pd.get_dummies(X)
X_encoded['PLAYER_NAME'] = df['PLAYER_NAME']
X_train, X_test, y_train, y_test = train_test_split(
X_encoded.drop(columns=['PLAYER_NAME']), # Remove PLAYER_ID for training
y,
test_size=0.2,
stratify=y,
random_state=42
)
test_player_ids = X_encoded.iloc[X_test.index]['PLAYER_NAME']
Finally, we train the random forest with x and y, and store the model for analysis. For this project, I used a model trained specifically on the 24-25 season, and tested the model on data from previous years.
model = RandomForestClassifier(n_estimators=100, random_state=42, verbose=0)
model.fit(X_train, y_train)
joblib.dump({
'model': model,
'test_player_ids': test_player_ids,
'feature_names': X_train.columns
}, './models/random_forest_24_25.joblib')
['./models/random_forest_24_25.joblib']
3. Testing Model PerformanceΒΆ
import pandas as pd
import numpy as np
import joblib
Load the stored model and merged data from a different season (in this case, 24-25 model on 23-24 data). Then drop the same columns from this data.
model_data = joblib.load('./models/random_forest_24_25.joblib')
model = model_data['model']
trained_features = model_data['feature_names']
df = pd.read_csv('./raw_data/NBA_2024_Shots.csv')
df = df.drop(columns=['SEASON_2', 'GAME_ID', 'ZONE_ABB', 'EVENT_TYPE', 'GAME_DATE',
'PLAYER_ID', 'TEAM_ID', 'TEAM_NAME'])
Now we can see the model's predictions and search by any desired metrics. For preliminary testing, I created predicitons for some individual players, just displaying accuracy as well as each example that was incorrectly classified. (Limited to 5 examples here)
player_name = "Immanuel Quickley"
player_rows = df[df['PLAYER_NAME'] == player_name].copy()
player_y = player_rows['SHOT_MADE'].astype(int)
player_X = player_rows.drop(columns=['SHOT_MADE', 'PLAYER_NAME'])
player_X_encoded = pd.get_dummies(player_X)
player_X_encoded = player_X_encoded.reindex(columns=trained_features, fill_value=0)
predictions = model.predict(player_X_encoded)
probabilities = model.predict_proba(player_X_encoded)[:, 1]
player_rows = player_rows.assign(
PREDICTED_MADE=predictions,
PREDICTED_PROB=probabilities
)
correct = np.sum(player_y.values == predictions)
total = len(player_y)
print(f"Correct Predictions: {correct} / {total}")
print(f"Accuracy for {player_name}: {correct / total}")
# find examples that were incorrectly classified
mismatches = player_rows[player_rows['SHOT_MADE'] != player_rows['PREDICTED_MADE']]
print("\nMismatched Predictions (SHOT_MADE != PREDICTED_MADE):")
print(mismatches[[
'PLAYER_NAME',
'ACTION_TYPE',
'SHOT_TYPE',
'SHOT_DISTANCE',
'ZONE_NAME',
'SHOT_MADE',
'PREDICTED_MADE',
'PREDICTED_PROB'
]].head(5).to_string())
Correct Predictions: 537 / 894
Accuracy for Immanuel Quickley: 0.6006711409395973
Mismatched Predictions (SHOT_MADE != PREDICTED_MADE):
PLAYER_NAME ACTION_TYPE SHOT_TYPE SHOT_DISTANCE ZONE_NAME SHOT_MADE PREDICTED_MADE PREDICTED_PROB
17577 Immanuel Quickley Jump Shot 3PT Field Goal 25 Left Side Center True 0 0.24
17588 Immanuel Quickley Jump Shot 3PT Field Goal 27 Left Side Center True 0 0.44
17611 Immanuel Quickley Driving Floating Jump Shot 2PT Field Goal 11 Left Side True 0 0.44
17626 Immanuel Quickley Driving Floating Bank Jump Shot 2PT Field Goal 13 Right Side True 0 0.42
17650 Immanuel Quickley Driving Floating Jump Shot 2PT Field Goal 9 Center True 0 0.41
4. Generating Unique Visualizations and MetricsΒΆ
Expected vs. Actual PointsΒΆ
One statistic I wanted to focus on is the notion of 'shot selection', specifically looking at the proportion of shots a player takes that the model predicts to be a miss or a make. Additionally, can look at how often the player makes a shot he is predicited to miss (a 'difficult' shot), or vice versa.
X = df.drop(columns=['SHOT_MADE', 'PLAYER_NAME'])
X_encoded = pd.get_dummies(X)
X_encoded = X_encoded.reindex(columns=trained_features, fill_value=0)
df['PREDICTED_PROB'] = model.predict_proba(X_encoded)[:, 1]
df['PREDICTED_MADE'] = model.predict(X_encoded)
point_map = {
'2PT Field Goal': 2,
'3PT Field Goal': 3
}
df['SHOT_VALUE'] = df['SHOT_TYPE'].map(point_map)
df['ACTUAL_POINTS'] = df['SHOT_MADE'] * df['SHOT_VALUE']
df['EXPECTED_POINTS'] = df['PREDICTED_PROB'] * df['SHOT_VALUE']
player_summary = df.groupby('PLAYER_NAME').agg(
total_shots=('SHOT_MADE', 'count'),
actual_points=('ACTUAL_POINTS', 'sum'),
expected_points=('EXPECTED_POINTS', 'sum'),
)
player_summary['points_above_expected'] = player_summary['actual_points'] - player_summary['expected_points']
top10 = (
player_summary
.sort_values(by='points_above_expected', ascending=False)
.head(10)
)
print("Top 10 Players by Points Above Expected:")
print(top10.round(2).to_string())
import plotly.express as px
import plotly.io as pio
pio.renderers.default = 'notebook'
player_summary = player_summary.reset_index()
fig = px.scatter(
player_summary,
x='expected_points',
y='points_above_expected',
hover_name='PLAYER_NAME',
opacity=0.9,
labels={
'expected_points': 'Expected Points',
'points_above_expected': 'Points Above Expected',
'total_shots': 'Total Shots'
},
title='Player Expected Points vs. Points Above Expected',
template='simple_white'
)
fig.add_shape(
type="line",
x0=player_summary['expected_points'].min(),
x1=player_summary['expected_points'].max(),
y0=0,
y1=0,
line=dict(color='gray', dash='dot', width=1)
)
fig.update_layout(
template = "seaborn",
font=dict(family="Helvetica", size=14, color='black'),
title_font=dict(size=22, family="Helvetica", color='black'),
height=700,
width=950,
margin=dict(l=60, r=30, t=70, b=60),
xaxis=dict(
showgrid=True,
gridcolor='rgba(200,200,200,0.2)',
zeroline=False,
linecolor='rgba(0,0,0,0.3)'
),
yaxis=dict(
showgrid=True,
gridcolor='rgba(200,200,200,0.2)',
zeroline=False,
linecolor='rgba(0,0,0,0.3)'
),
hoverlabel=dict(
bgcolor='white',
font_size=14,
font_family='Helvetica'
)
)
fig.show()
Top 10 Players by Points Above Expected:
total_shots actual_points expected_points points_above_expected
PLAYER_NAME
Luka Doncic 1652 1892 1664.76 227.24
Nikola Jokic 1411 1727 1510.66 216.34
Kevin Durant 1436 1670 1509.19 160.81
Stephen Curry 1445 1657 1504.03 152.97
Kawhi Leonard 1162 1360 1208.93 151.07
Jalen Brunson 1648 1791 1641.07 149.93
Shai Gilgeous-Alexander 1487 1687 1552.75 134.25
CJ McCollum 1055 1207 1073.40 133.61
Kyrie Irving 1131 1297 1164.22 132.78
Paul George 1236 1407 1279.42 127.58
We use plotly to create an interactive scatterplot - hovering over one of the dots will show who that player is, their expected points, as well as their true points above the expected. This plot seems to make some sense, the players with the highest points above expected are some of those generally considered to be the current top players (Nikola Jokic, Luke Doncic, Kevin Durant, etc.)
One thing to consider is that this plot works better for players who have taken more shots, since it doesn't take into account any sort of proportion. Rather, only those with a certain amount of shots taken will have any sort of meaningful trend to be considered.
Shot Difficulty ProportionΒΆ
Another metric that can be considered using this model is the proportion of shots deemed to be "difficult" (P(SHOT_MADE) < 0.5) by different players. We can then see what kind of shots specific players like to take, but also we can look at other features such as the shot clock or shot type to see what other factors tend to produce difficult or easy shots.